May 14th 2020

Project requirements

Follow the IMRAD standard scientific structure: - Introduction - Materials and Methods - Results (And) - Discussion With a technical focus, but minding to communicate which-ever biological insights you arrived at

Should not include all your code (we will look into that at the individual examinations), but rather focus on the broader picture of what you did and include data summaries and visualisations

Created using ioslides_presentation rmarkdown (i.e. the right-most doc column in the project organisation will be a rmarkdown based presentation)

Introduction

  • Intro to snake venom
  • Data set for the study:
    • Venom compositions from snakes all around the world
  • Goal of study:
    • Group snakes by genus based on venom composition (PCA, K-means, ANN)

The datasets

  • Main data
## # A tibble: 242 x 100
##   Snake Reference Note  `SVMP (Snake Ve… `PI-SVMP (Snake… `PII-SVMP (Snak…
##   <chr> <chr>     <chr>            <dbl>            <dbl>            <dbl>
## 1 Agki… https://… Mexi…             24.5                0                0
## 2 Agki… https://… Cost…             30.8                0                0
## 3 Agki… https://… Mexi…             30.6                0                0
## 4 Agki… https://… Orig…             32.5                0                0
## # … with 238 more rows, and 94 more variables: `PIII-SVMP (Snake Venom
## #   Metalloproteinase PIII), %` <dbl>, …
  • New data
## # A tibble: 27 x 4
##   Toxin               `Vipera aspis asp… `Vipera berus ber… `Vipera anatolica s…
##   <chr>                            <dbl>              <dbl>                <dbl>
## 1 SVMP (Snake Venom …               13.4                 NA                 42.9
## 2 3Ftx (three-finger…               NA                   NA                 NA  
## 3 Unknown peptides                  NA                   NA                 23.5
## 4 PLA2 (Phospholipas…               30.9                 NA                  8.2
## # … with 23 more rows

Materials and methods

  • Loading and cleaning data
    • Merge datsets
    • Map locations to country
  • Augmentation of data
    • Group snake subspecies and venom types
  • Initial Analysis and visualisations
    • Venom composition
    • Geographical distribution
  • Unsupervised analysis
    • PCA
    • K-means clustering
  • Supervised classification model
    • Artificial Neural Network (ANN)

Cleaning

## # A tibble: 124 x 1
##    Note          
##    <chr>         
##  1 Mexico        
##  2 Costa Rica    
##  3 Origin unknown
##  4 USA           
##  5 Texas         
##  6 Kentucky      
##  7 Missouri      
##  8 Florida       
##  9 Australia     
## 10 Costa rica    
## # … with 114 more rows

Results from cleaning and augmenting the data

  • Show dirty data vs. clean data
  • Show region, family, genus, species rows
  • Show grouping of the toxin families columns

Augmented data

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Snake = col_character(),
##   Genus = col_character(),
##   Species = col_character(),
##   Reference = col_character(),
##   Country = col_character(),
##   Continent = col_character(),
##   Family = col_character()
## )
## See spec(...) for full column specifications.

World map

Snake family count

Snake family

Most abundant toxins

Compare two snakes

Intra species comparison

Shiny app

Results from PCA and K-means

Prediction models based on venom composition

  • A smiple vanilla ANN managed to correctly classify the whole testset (25 % of data)
    • Specifications: 4 hidden neurons, learning rate = 0.001, n_epocs = 100, loss criterion = Binary Crossentropy
  • Prediction of which continent, the snake originated from
    • Attempted a number of architechtures ranging from 1 to 4 hidden layers, with/without dropout and tried optimizing hyperparameters.
    • Problems with overfitting - further regularization e.g. early stopping might be a solution
    • Question might be ill posed - venom composition predicts snake family and not necessarily location. e.g. snakes from two different families both from the same country have completely different venom compositions.

Training of ANN predicting snake family

Analysis of incorrect labels (1)

  • If test size is increased to 40 %, the model misclassifies 5 snakes as illustrated below:

Analysis of incorrect labels (2)

Analysis of the venom composition of the incorrectly labeled snakes:

## # A tibble: 5 x 2
##   Snake                    Family   
##   <chr>                    <chr>    
## 1 Daboia russelii russelii Viperidae
## 2 Hydrophis cyanocinctus   Elapidae 
## 3 Micropechis ikaheka      Elapidae 
## 4 Naja kaouthia            Elapidae 
## 5 Naja kaouthia            Elapidae